AML2019
Anomaly detection (AD) refers to the process of detecting data points that do not conform with the rest of observations. Applications of anomaly detection include fraud and fault detection, surveillance, diagnosis, data cleanup, predictive maintenance.
When we talk about AD, we usually look at it as an unsupervised (or semi-supervised) task, where the concept of anomaly is often not well defined or, in the best case, just few samples are labeled as anomalous. In this challenge, you will look at AD from a different perspective!
The dataset you are going to work on consists of monitoring data generated by IT systems; such data is then processed by a monitoring system that executes some checks and detects a series of anomalies. This is a multi-label classification problem, where each check is a binary label corresponding to a specific type of anomaly. Your goal is to develop a machine learning model (or multiple ones) to accurately detect such anomalies.
This will also involve a mixture of data exploration, pre-processing, model selection, and performance evaluation. You will also be asked to try one or more rule learning models, and compare them with other ML models both in terms of predictive performances and interpretability. Interpreatibility is indeed a strong requirement especially in applications like AD where understanding the output of a model is as important as the output itself.
Please, bear in mind that the purpose of this challenge is not simply to find the best-performing model. You should rather make sure to understand the difficulties that come with this AD task.
import numpy as np
import pandas as pd
import os
#####IMPORT ALL NEEDED LIBRABRIES#######
%matplotlib inline
import os
import sys
import re
import random
import matplotlib
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import implicit
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from time import time
os.environ["OPENBLAS_NUM_THREADS"] = "1" # Required by implicit
base = "/mnt/datasets/anomaly/"
os.listdir(base)
# Read the .csv file (might take a while)
df = pd.read_csv(base + 'data.csv',
sep=';',
header=None,
names=['SessionNumber', 'SystemID', 'Date', 'HighPriorityAlerts', 'Dumps', 'CleanupOOMDumps', 'CompositeOOMDums',
'IndexServerRestarts', 'NameServerRestarts', 'XSEngineRestarts', 'PreprocessorRestarts', 'DaemonRestarts',
'StatisticsServerRestarts', 'CPU', 'PhysMEM', 'InstanceMEM', 'TablesAllocation','IndexServerAllocationLimit',
'ColumnUnloads', 'DeltaSize', 'MergeErrors', 'BlockingPhaseSec', 'Disk', 'LargestTableSize',
'LargestPartitionSize', 'DiagnosisFiles', 'DiagnosisFilesSize', 'DaysWithSuccessfulDataBackups',
'DaysWithSuccessfulLogBackups', 'DaysWithFailedDataBackups', 'DaysWithFailedfulLogBackups',
'MinDailyNumberOfSuccessfulDataBackups', 'MinDailyNumberOfSuccessfulLogBackups',
'MaxDailyNumberOfFailedDataBackups', 'MaxDailyNumberOfFailedLogBackups', 'LogSegmentChange',
'Check1', 'Check2','Check3','Check4','Check5','Check6','Check7','Check8'])
df.head()
#Display the columns with the higher rate of missing values
df_na = (df.isnull().sum() / len(df)) * 100 #count the number of missing values
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)[:30] #get rid of the column with no missing values
missing_data = pd.DataFrame({'Missing Ratio' :df_na}) #create the dataframe containing the column and its ratio of missing values
missing_data.sort_values('Missing Ratio', ascending=False).head() #display the 5 first results of this dataframe()
#look at the representation of Nan values in the dataframe
nb_nans_in_col=df.isnull().sum().to_frame() #number of Nans for each columns
totcells=df.shape[0]*df.shape[1] #number of cells we have in the dataframe
print('Nan values represent %s'%round(nb_nans_in_col.sum()[0]/totcells*100,2) +"% of all the data contained in our dataframe.")
#values for which all checks are NaN
checks=df.iloc[:,36:]
idx = checks.index[checks.isnull().all(1)]
nans = checks.iloc[idx]
print("NaN's shape :",nans.shape)
nans.head()
#Drop rows with all unknown checks
df.drop(nans.index, inplace = True)
print('Number of values different from 0 in PreprocessorRestarts =', (df['PreprocessorRestarts']!=0).sum())
print('Number of values different from 0 in DaemonRestarts =', (df['DaemonRestarts']!=0).sum())
print('Number of values different from 0 in CleanupOOMDumps =', (df['CleanupOOMDumps']!=0).sum())
print('Number of Nan values in CleanupOOMDumps =', len(df[np.isnan(df['CleanupOOMDumps'])==True]))
df.drop(columns=['PreprocessorRestarts','DaemonRestarts','CleanupOOMDumps'], inplace=True)
#drop columns with nan values
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
#look at outliers (data not in the definition domain)
#INTEGER [0, N] / count negative values (not in the definition domain)
if ((df['HighPriorityAlerts'] <0).sum() != 0): print((df['HighPriorityAlerts'] <0).sum(),'values not in the domain of definition')
if ((df['Dumps'] <0).sum() != 0):print((df['Dumps'] <0).sum(),'values not in the domain of definition')
# print((df['CleanupOOMDumps'] <0).sum()) #deleted columns
if ((df['CompositeOOMDums'] <0).sum() != 0):print((df['CompositeOOMDums'] <0).sum(),'values not in the domain of definition')
if ((df['IndexServerRestarts'] <0).sum() != 0):print((df['IndexServerRestarts'] <0).sum(),'values not in the domain of definition')
if ((df['NameServerRestarts'] <0).sum() != 0):print((df['NameServerRestarts'] <0).sum(),'values not in the domain of definition')
if ((df['XSEngineRestarts'] <0).sum() != 0):print((df['XSEngineRestarts'] <0).sum(),'values not in the domain of definition')
# print((df['PreprocessorRestarts'] <0).sum()) #deleted columns
# print((df['DaemonRestarts '] <0).sum()) #deleted columns
if ((df['StatisticsServerRestarts'] <0).sum() != 0):print((df['StatisticsServerRestarts'] <0).sum(),'values not in the domain of definition')
if ((df['ColumnUnloads'] <0).sum() != 0):print((df['ColumnUnloads'] <0).sum(),'values not in the domain of definition')
if ((df['DeltaSize'] <0).sum() != 0):print((df['DeltaSize'] <0).sum(),'values not in the domain of definition')
if ((df['BlockingPhaseSec'] <0).sum() != 0):print((df['BlockingPhaseSec'] <0).sum(),'values not in the domain of definition')
if ((df['LargestTableSize'] <0).sum() != 0):print((df['LargestTableSize'] <0).sum(),'values not in the domain of definition')
if ((df['LargestPartitionSize'] <0).sum() != 0):print((df['LargestPartitionSize'] <0).sum(),'values not in the domain of definition')
if ((df['DiagnosisFiles'] <0).sum() != 0):print((df['DiagnosisFiles'] <0).sum(),'values not in the domain of definition')
if ((df['DiagnosisFilesSize'] <0).sum() != 0):print((df['DiagnosisFilesSize'] <0).sum(),'values not in the domain of definition')
if ((df['DaysWithSuccessfulDataBackups'] <0).sum() != 0):print((df['DaysWithSuccessfulDataBackups'] <0).sum(),'values not in the domain of definition')
if ((df['DaysWithSuccessfulLogBackups'] <0).sum() != 0):print((df['DaysWithSuccessfulLogBackups'] <0).sum(),'values not in the domain of definition')
if ((df['DaysWithFailedDataBackups'] <0).sum() != 0):print((df['DaysWithFailedDataBackups'] <0).sum(),'values not in the domain of definition')
if ((df['DaysWithFailedfulLogBackups'] <0).sum() != 0):print((df['DaysWithFailedfulLogBackups'] <0).sum(),'values not in the domain of definition')
if ((df['MinDailyNumberOfSuccessfulDataBackups'] <0).sum() != 0):print((df['MinDailyNumberOfSuccessfulDataBackups'] <0).sum(),'values not in the domain of definition')
if ((df['MinDailyNumberOfSuccessfulLogBackups'] <0).sum() != 0):print((df['MinDailyNumberOfSuccessfulLogBackups'] <0).sum(),'values not in the domain of definition')
if ((df['MaxDailyNumberOfFailedDataBackups'] <0).sum() != 0):print((df['MaxDailyNumberOfFailedDataBackups'] <0).sum(),'values not in the domain of definition')
if ((df['MaxDailyNumberOfFailedLogBackups'] <0).sum() != 0):print((df['MaxDailyNumberOfFailedLogBackups'] <0).sum(),'values not in the domain of definition')
if ((df['LogSegmentChange'] <0).sum() != 0):print('LogSegmentChange:',(df['LogSegmentChange'] <0).sum(),'values not in the domain of definition')
#FLOAT [0;100] / count valuees out of the bound
#J'AI SÉPARÉE POUR PAS AVOIR CPU : 0 9 MAIS PLUTOT CPU : 9 DIRECT C'EST MIEUX JE TROUVE
if ((df['CPU'] <0).sum() !=0):print('CPU:',(df['CPU'] <0).sum(),'values not in the domain of definition')
if ((df['PhysMEM'] <0).sum() !=0):print('PhysMEM:',(df['PhysMEM'] <0).sum(),'values not in the domain of definition')
if ((df['InstanceMEM'] <0).sum() !=0):print('InstanceMEM:',(df['InstanceMEM'] <0).sum(),'values not in the domain of definition')
if ((df['PhysMEM'] <0).sum() !=0):print('PhysMEM:',(df['PhysMEM'] <0).sum(),'values not in the domain of definition')
if ((df['TablesAllocation'] <0).sum() !=0):print('TablesAllocation:',(df['TablesAllocation'] <0).sum(),'values not in the domain of definition')
if ((df['IndexServerAllocationLimit'] <0).sum() !=0):print('IndexServerAllocationLimit:',(df['IndexServerAllocationLimit'] <0).sum(),'values not in the domain of definition')
if ((df['Disk'] <0).sum() !=0):print('Disk:',(df['Disk'] <0).sum(),'values not in the domain of definition')
if ((df['CPU']>100).sum() != 0):print('CPU:',(df['CPU']>100).sum(), 'values not in the domain of definition')
if ((df['PhysMEM']>100).sum() != 0):print('PhysMEM:' ,(df['PhysMEM']>100).sum(),'values not in the domain of definition')
if ((df['InstanceMEM']>100).sum() != 0):print('InstanceMEM:',(df['InstanceMEM']>100).sum(),'values not in the domain of definition')
if ((df['TablesAllocation']>100).sum() != 0):print('TablesAllocation:',(df['TablesAllocation']>100).sum(),'values not in the domain of definition')
if ((df['IndexServerAllocationLimit']>100).sum() != 0):print('IndexServerAllocationLimit:',(df['IndexServerAllocationLimit']>100).sum(),'values not in the domain of definition')
if ((df['Disk']>100).sum() != 0):print('Disk:',(df['Disk']>100).sum(),'values not in the domain of definition')
df = df.drop(df[df['LogSegmentChange'] <0].index)
df = df.drop(df[df['CPU']>100].index)
df = df.drop(df[df['PhysMEM']>100].index)
df = df.drop(df[df['Disk']>100].index)
ones=df.iloc[:,33:].sum().values #contains number of time each check is set to 1
zeros=len(df.iloc[:,33:])-df.iloc[:,33:].sum().values #contains number of time each check is set to 0
checks=df.columns[33:].values #contain names of each checks
# Create a DataFrame which contains the name of the checks and the number of 1 and 0 associated
numbers = {'Checks': checks,'NumberOfOnes':ones, 'NumberOfZeros':zeros}
df_nbCheck= pd.DataFrame(numbers)
#set index for the plot
df_nbCheck = df_nbCheck.set_index('Checks')
#plot
ax=df_nbCheck.plot.bar(stacked=True, figsize=(15,7))
#text labels showing the real number of 1 and 0
rects = ax.patches
labels1 = ones
labels0=zeros
for rect, label1, label0 in zip(rects, labels1, labels0):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label1, ha='center', va='bottom')
ax.text(rect.get_x() + rect.get_width()/2, 175000-height, label0, ha='center', va='bottom')
#create lists with number of check counts as key and number of rows having this number as values
rowSums = df.iloc[:,36:].sum(axis=1)
multiLabel_counts = rowSums.value_counts().to_dict()
#We add the Checks with NaN values if we they were already dropped from the dataframe
multiLabel_counts['OnlyNans']=len(nans)
multiLabel_counts[0]=multiLabel_counts[0]-multiLabel_counts['OnlyNans']
#plot
sns.set()
plt.figure(figsize=(15,7))
ax = sns.barplot(list(multiLabel_counts.keys()), list(multiLabel_counts.values()))
plt.title("Sessions having multiple types of checks ")
plt.ylabel('Number of sessions')
plt.xlabel('Number of checks')
#adding the text labels
rects = ax.patches
labels = multiLabel_counts.values()
for rect, label in zip(rects, labels):
height = rect.get_height() #number of sessions
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show()
#create a list of the different check labels
categories = list(df.columns.values)
datas=categories[:33]
categories = categories[33:]
#Correlation map to see how checks are correlated between one another
corrmat = df[categories].corr() #NB: correlations exclude NA values
plt.subplots()
sns.heatmap(corrmat, vmax=0.9, square=True, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#Correlation map to see how the attributes are correlated to the labels (Checks)
corrmat = df.corr()
corr1=corrmat.iloc[32:,:32]
plt.subplots(figsize=(24,22))
sns.heatmap(corr1, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#plot graphs representing the most correlated columns to each check
plt.figure(1,figsize=(20,20))
for i in range(1, 9):
plt.subplot(3 , 3 , i)
plt.subplots_adjust(hspace=0.9)
corr=df.corr()['Check%s'%i]
corr_abs = corr.apply(np.abs)
corr_abs.sort_values(ascending=False,inplace=True)
corr_abs_top_ten = corr_abs[1:11]
plt.plot(corr_abs_top_ten)
plt.xticks(rotation=90)
plt.title('Check%s'%i,fontsize=15)
# CHECK1
df_check1 = df[['CPU','Check1']].dropna()
# We reinitialize the index of the dataframe
df_check1.reset_index(inplace=True)
#We Create a column called Session containing the value of the index
df_check1['Session']=df_check1.index
#We reorganize the colmuns
df_check1 = df_check1[['Session','CPU','Check1']]
#PLOT
df_check1.plot.scatter(x='Session', y='CPU', c='Check1',colormap='viridis', figsize=(25,8), grid=True)
plt.show()
#CHECK2 AND CHECK 4
plt.figure(figsize=(15,7))
df_check2_4 = df[['Check2','Check4','InstanceMEM','IndexServerAllocationLimit']]
df_check2_4.reset_index(inplace=True)
df_check2_4['Session']=df_check2_4.index
df_check2_4 = df_check2_4[['Session','Check2','Check4','InstanceMEM','IndexServerAllocationLimit']]
#Plot of the different combinations between Check2 and Check4
conditions = [
(df_check2_4['Check2'] == 0.0) & (df_check2_4['Check4'] == 0.0),
(df_check2_4['Check2'] == 0.0) & (df_check2_4['Check4'] == 1.0),
(df_check2_4['Check2'] == 1.0) & (df_check2_4['Check4'] == 0.0),
(df_check2_4['Check2'] == 1.0) & (df_check2_4['Check4'] == 1.0)]
choices = ['(0,0)', '(0,1)','(1,0)', '(1,1)']
df_check2_4['combination'] = np.select(conditions, choices, default='black')
fig, ax1 = plt.subplots(figsize=(15,8))
plt.title("Combinations of Check2 and Check4", fontsize=15)
graph = sns.countplot(ax=ax1,x='combination',data=df_check2_4)
graph.set_xticklabels(graph.get_xticklabels())
for p in graph.patches:
height = p.get_height()
graph.text(p.get_x()+p.get_width()/2., height+5 ,height,ha="center",va='bottom')
plt.show()
df_check2_4.plot.scatter(x='Session', y='InstanceMEM', c='Check2',colormap='viridis', figsize=(25,8), grid=True)
df_check2_4.plot.scatter(x='Session', y='IndexServerAllocationLimit', c='Check4',colormap='viridis', figsize=(25,8), grid=True)
plt.show()
#CHECK 3
df_check3 = df[['Check3','PhysMEM','LargestPartitionSize','LargestTableSize','IndexServerAllocationLimit']]
df_check3.dropna(inplace=True)
df_check3.reset_index(inplace=True)
df_check3['Session']=df_check3.index
df_check3 = df_check3[['Session','Check3','PhysMEM','LargestPartitionSize','LargestTableSize','IndexServerAllocationLimit']]
df_check3.plot.scatter(x='Session', y='PhysMEM', c='Check3',colormap='viridis', figsize=(25,8), grid=True)
plt.show()
#Check5
df_check5 = df[['Check2','Check4','InstanceMEM','IndexServerAllocationLimit','Check5','TablesAllocation']]
df_check5.dropna(inplace=True)
df_check5.reset_index(inplace=True)
df_check5['Session']=df_check5.index
df_check5 = df_check5[['Session','Check2','Check4','InstanceMEM','IndexServerAllocationLimit','Check5','TablesAllocation']]
conditions = [
(df_check5['Check2'] == 0.0) & (df_check5['Check4'] == 0.0)&(df_check5['Check5'] == 0.0),
(df_check5['Check2'] == 0.0) & (df_check5['Check4'] == 0.0)&(df_check5['Check5'] == 1.0),
(df_check5['Check2'] == 0.0) & (df_check5['Check4'] == 1.0)&(df_check5['Check5'] == 0.0),
(df_check5['Check2'] == 0.0) & (df_check5['Check4'] == 1.0)&(df_check5['Check5'] == 1.0),
(df_check5['Check2'] == 1.0) & (df_check5['Check4'] == 0.0)&(df_check5['Check5'] == 0.0),
(df_check5['Check2'] == 1.0) & (df_check5['Check4'] == 0.0)&(df_check5['Check5'] == 1.0),
(df_check5['Check2'] == 1.0) & (df_check5['Check4'] == 1.0)&(df_check5['Check5'] == 0.0),
(df_check5['Check2'] == 1.0) & (df_check5['Check4'] == 1.0)&(df_check5['Check5'] == 1.0)
]
choices = ['(0,0,0)', '(0,0,1)','(0,1,0)', '(0,1,1)','(1,0,0)', '(1,0,1)','(1,1,0)', '(1,1,1)']
df_check5['combinaison'] = np.select(conditions, choices, default='black')
fig, ax1 = plt.subplots(figsize=(15,8))
graph = sns.countplot(ax=ax1,x='combinaison',data=df_check5)
graph.set_xticklabels(graph.get_xticklabels())
for p in graph.patches:
height = p.get_height()
graph.text(p.get_x()+p.get_width()/2., height+5 ,height ,ha="center",va='bottom')
plt.title("Distribution of the different possible combinations between check 2, 4 and 5")
df_check5.plot.scatter(x='Session', y='TablesAllocation', c='Check5',colormap='viridis', figsize=(25,8), grid=True)
plt.title("TablesAllocation value and its associated check5 value among the sessions")
plt.show()
#Check6
df_check6 = df[['Check6','IndexServerAllocationLimit','HighPriorityAlerts','TablesAllocation']]
df_check6.dropna(inplace=True)
df_check6.reset_index(inplace=True)
df_check6['Session']=df_check6.index
df_check6 = df_check6[['Session','Check6','IndexServerAllocationLimit','HighPriorityAlerts','TablesAllocation']]
df_check6.plot.scatter(x='Session', y='IndexServerAllocationLimit', c='Check6',colormap='viridis', figsize=(25,8), grid=True)
plt.title("IndexServerAllocationLimit with their check6 values among the Sessions")
df_check6.plot.scatter(x='Session', y='HighPriorityAlerts', c='Check6',colormap='viridis', figsize=(25,8), grid=True)
plt.title("HighPriorityAlerts with their check6 values among the Sessions")
df_check6.plot.scatter(x='Session', y='TablesAllocation', c='Check6',colormap='viridis', figsize=(25,8), grid=True)
plt.title("TablesAllocation with their check6 values among the Sessions")
#correlation between the 3 columns, for check6=1
df_check6.sort_values(by=['Check6','HighPriorityAlerts'],ascending=False,inplace=True)
a=df_check6[df_check6['Check6']==1.0]
a.plot.scatter(x='HighPriorityAlerts', y='IndexServerAllocationLimit', c='TablesAllocation',colormap='Reds', alpha=0.5, figsize=(25,8), grid=True)
plt.title("Correlation between the 3 columns, for check6=1")
#correlation between the 3 columns, for check6=0
b=df_check6[df_check6['Check6']==0.0]
b.plot.scatter(x='HighPriorityAlerts', y='IndexServerAllocationLimit', c='TablesAllocation',colormap='Reds', alpha=0.5, figsize=(25,8), grid=True)
plt.title("Correlation between the 3 columns, for check6=0")
plt.show()
#Check 7
df_check7 = df[['Check7','LogSegmentChange','HighPriorityAlerts']]
df_check7.dropna(inplace=True)
df_check7.reset_index(inplace=True)
df_check7['Session']=df_check7.index
df_check7 = df_check7[['Session','Check7','LogSegmentChange', 'HighPriorityAlerts']]
df_check7.plot.scatter(x='Session', y='LogSegmentChange', c='Check7',colormap='viridis', figsize=(25,8), grid=True)
plt.title("LogSegmentChange value and its associated check7 among the sessions")
plt.show()
plt.scatter(df['Check7'], df['LogSegmentChange'])
plt.title("LogSegmentChange according to the check7 value")
plt.xlabel("Check7 value")
plt.ylabel("LogSegmentChange")
plt.show()
#take a look at the corrrelation between the second most correlated column and check7
a=df_check7.sort_values(by=['HighPriorityAlerts','Check7'],ascending=False)
sns.countplot(x='HighPriorityAlerts',hue='Check7',data=a, log=True)
plt.title("Number of rows with check7 equal to 0 (blue) or 1 (orange) according to their number of high priority alert")
plt.show()
#Check 8
df_check8 = df[['Check8','NameServerRestarts']]
df_check8.dropna(inplace=True)
df_check8.reset_index(inplace=True)
df_check8['Session']=df_check8.index
df_check8 = df_check8[['Session','Check8','NameServerRestarts']]
df_check8.plot.scatter(x='Session', y='NameServerRestarts', c='Check8',colormap='viridis', figsize=(25,8), grid=True)
plt.title("Number of NameServer restarts and the check8 value associated, according to the sessions")
plt.show()
#another plot, with sorted values according to the check8 value
df_check8.sort_values(by=['Check8','NameServerRestarts'],ascending=False,inplace=True)
df_check8.reset_index(inplace=True)
df_check8['Session']=df_check8.index
df_check8[0:15000].plot.scatter(x='Session', y='NameServerRestarts', c='Check8',colormap='viridis', figsize=(25,8), grid=True)
plt.title("Number of NameServer restarts according to the sessions sorted regarding the value of check8")
plt.show()
#create a dataframe containing the id, the date, the associated month, year and week number
df_date=pd.DataFrame()
df_date['Id']=df['SessionNumber']
df_date['Date']=pd.to_datetime(df.Date, dayfirst=True)
df_date['year'] = pd.DatetimeIndex(df_date['Date']).year
df_date['month'] = pd.DatetimeIndex(df_date['Date']).month
df_date['month_year'] = pd.to_datetime(df_date['Date']).dt.to_period('M')
df_date['Week_Number'] = df_date['Date'].dt.week
df_date.head()
#add the checks to this dataframe and drop the date and month_year columns (not computable by the correlation matrix)
df_date_checks=pd.concat([df_date, df.iloc[:,33:]], axis=1, sort=False)
df_date_checks.drop(columns=['Date', 'month_year'], inplace=True)
# max(df_date[df_date['year']==2017]['Week_Number'])
# df_date.loc[df_date['year'] == 2018, 'Week_Number'] = df_date['Week_Number']+52
df_date.month_year.unique()
#plot correlation between the week_number and the session number
plt.figure(figsize=(8,5))
plt.scatter(df_date['Id'],df_date['Week_Number'])
plt.title("week number according to the session number")
plt.xlabel("Id")
plt.ylabel("Week_Number")
plt.show()
corrmat = df_date_checks.corr()
corr1=corrmat.iloc[:4,4:]
plt.subplots(figsize=(7,5))
sns.heatmap(corr1, vmax=0.9, square=True, cbar_kws={"shrink": 0.6}, annot=True, annot_kws={"rotation": 45, "size":10})
plt.yticks(rotation=0)
plt.show()
#we can get rid of the column 'Date'
df.drop(columns='Date', inplace=True)
import time
from sklearn.model_selection import train_test_split
#Logistic Regression
from sklearn.linear_model import LogisticRegression
# Random Forest
from sklearn.ensemble import RandomForestClassifier
# Performance metric
from sklearn.metrics import f1_score
#confusion matrix
from sklearn.metrics import confusion_matrix
#Normalization
from sklearn import preprocessing
#Oversampling
from imblearn.over_sampling import SMOTE
df_imbalanced = df.copy()
#Shuffle the dataframe
df_imbalanced =df_imbalanced.sample(frac=1, random_state=42)
#We Delete the colmun we don't need
df_imbalanced.drop(['SessionNumber','SystemID','Check1','Check2','Check3','Check4','Check5','Check6','Check8'],axis=1,inplace=True)
#Split the features from the target (Check1)
imbalanced_target = df_imbalanced['Check7']
imbalanced_features = df_imbalanced.drop(['Check7'],axis=1)
#We normalize
x = imbalanced_features.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
imbalanced_features = pd.DataFrame(x_scaled)
#WE SPLIT THE OVESAMPLED DATASET
X_train_imbalanced, X_test_imbalanced, y_train_imbalanced, y_test_imbalanced = train_test_split(imbalanced_features, imbalanced_target, test_size=0.2, random_state=42)
# Turn the values into an array for feeding the classification algorithms.
X_train_imbalanced = X_train_imbalanced.values
X_test_imbalanced = X_test_imbalanced.values
y_train_imbalanced = y_train_imbalanced.values
y_test_imbalanced = y_test_imbalanced.values
#We fit the model
#WE APPLY THE LINEAR REGRESSION ON THE OVERSAMPLED TRAINING SET
model_imbalanced = LogisticRegression()
model_imbalanced.fit(X_train_imbalanced,y_train_imbalanced)
#WE PREDICT
predictions_imbalanced = model_imbalanced.predict(X_test_imbalanced)
cnf_matrix = confusion_matrix(y_test_imbalanced,predictions_imbalanced)
ax = fig.add_subplot(3,3,i)
plt.subplots_adjust(hspace=0.5)
sns.heatmap(cnf_matrix,annot=True,linewidths=1)
plt.title("Confusion_matrix of Check")
plt.xlabel("Predicted_class")
plt.ylabel("Real class")
print("the recall for this model is :",cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0]))
print("the precision for this model is :",cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[0,1]))
print("True Positive",cnf_matrix[1,1]) # no of anomalies which are predicted as anomalies
print("True Negative",cnf_matrix[0,0]) # no. of normal data which are predited as normal
print("False Positive",cnf_matrix[0,1]) # no of normal data which are predicted as anomalies
print("False Negative",cnf_matrix[1,0]) # no of anomalies which are predicted as normal
plt.show()
# WE COMPUTE THE F1 SCORE
print("F1 score :",f1_score(y_test_imbalanced,predictions_imbalanced,average='macro'))
#We create a copy of the original Dataframe
df1 = df.copy()
df1.shape
print('Distribution of the Check1 in the dataset :')
print(df1['Check1'].value_counts()/len(df1))
sns.countplot('Check1', data=df1)
plt.title('Distribution of Check1', fontsize=14)
plt.show()
#SHUFFLE THE DATAFRAME
df1 =df1.sample(frac=1, random_state=42)
#We keep the first 20 000 rows to use as test data
df_test = df1[:20000]
#We will try now to imbalance the remaining rows and use them as a training set
df1=df1[20000:]
print("The shape of the imbalanced test data :",df_test.shape)
print("The shape of the imbalanced training data :",df1.shape)
#We select the lines where we have a Check1 anomaly
ones_check1 = df1.loc[df['Check1'] == 1]
num_lines = ones_check1.shape[0]
#We select the normal observations
zeros=df1.loc[(df1['Check1'] ==0) & (df1['Check2'] ==0) & (df1['Check3'] ==0) & (df['Check4'] ==0) & (df1['Check5'] ==0) & (df1['Check6'] ==0) & (df1['Check7'] ==0) & (df1['Check8'] ==0)]
zeros_check1 = zeros[:num_lines]
zeros_check1.drop(['Check2','Check3','Check4','Check5','Check6','Check7','Check8'],axis=1,inplace=True)
ones_check1.drop(['Check2','Check3','Check4','Check5','Check6','Check7','Check8'],axis=1,inplace=True)
new_undersampled_df = pd.concat([ones_check1, zeros_check1])
new_undersampled_df = new_undersampled_df.sample(frac=1, random_state=42)
new_undersampled_df.head()
print('Distribution of the Classes in the subsample dataset')
print(new_undersampled_df['Check1'].value_counts()/len(new_undersampled_df))
sns.countplot('Check1', data=new_undersampled_df)
plt.title('Equally Distributed Classes', fontsize=14)
plt.show()
#Drop the first two colmuns
new_undersampled_df.drop(['SessionNumber','SystemID'],axis=1,inplace=True)
df_test.head()
#We separate the features from the targets
new_undersampled_target = new_undersampled_df['Check1']
new_undersampled_features = new_undersampled_df.drop(['Check1'],axis=1)
#We separate the features from the targets in the test set that we have apart
df_test_target = df_test['Check1']
df_test_features = df_test.drop(['SessionNumber','SystemID','Check1','Check2','Check3','Check4','Check5','Check6','Check7','Check8'],axis=1)
#Normalization of the undersampled data and the test set
x = new_undersampled_features.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
new_undersampled_features = pd.DataFrame(x_scaled)
x = df_test_features.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_test_features = pd.DataFrame(x_scaled)
# Turn the values into an array for feeding the classification algorithms.
X_train_undersampled = new_undersampled_features.values
y_train_undersampled = new_undersampled_target.values
X_test_original = df_test_features.values
y_test_original = df_test_target.values
model = LogisticRegression()
#We train our model on the undersampled dataset
model.fit(X_train_undersampled,y_train_undersampled)
#We predict on the test set whish is IMBALANCED
predictions_original = model.predict(X_test_original)
cnf_matrix = confusion_matrix(y_test_original,predictions_original)
print("the recall for this model is :",cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0]))
print("the precision for this model is :",cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[0,1]))
fig= plt.figure(figsize=(6,3))# to plot the graph
print("True Positive",cnf_matrix[1,1,]) # no of fraud transaction which are predicted fraud
print("True Negative",cnf_matrix[0,0]) # no. of normal transaction which are predited normal
print("False Positive",cnf_matrix[0,1]) # no of normal transaction which are predicted fraud
print("False Negative",cnf_matrix[1,0]) # no of fraud Transaction which are predicted normal
sns.heatmap(cnf_matrix,annot=True,linewidths=0.5)
plt.title("Confusion_matrix")
plt.xlabel("Predicted_class")
plt.ylabel("Real class")
plt.show()
f1 = f1_score(y_test_original,predictions_original,average='macro')
print("The F1 score using the undersampling technique :", f1)
def data_preparation(x): # preparing data for training and testing as we are going to use different data
#again and again so make a function
x_features= x.iloc[:,:len(x.columns)-1]
x_labels=x.iloc[:,len(x.columns)-1:]
x_features_train,x_features_test,x_labels_train,x_labels_test = train_test_split(x_features,x_labels,test_size=0.2)
return(x_features_train,x_features_test,x_labels_train,x_labels_test)
#We create a dictionary of dataframes, where each key refers to a dataframe with attiributes and only the key as the target
dict_df_check={}
for i in range(1,9):
oversample_df = df.copy()
arr=['Check1','Check2','Check3','Check4','Check5','Check6','Check7','Check8']
del arr[i-1]
oversample_df.drop(arr,axis=1,inplace=True)
oversample_df.drop(['SessionNumber','SystemID'],axis=1,inplace=True)
dict_df_check['check' + str(i)]=oversample_df
from imblearn.over_sampling import SMOTE
#We are using SMOTE as the function for oversampling
os = SMOTE(random_state=0)
# WE SPLIT THE DATASET TO A TRAINING SET AND TEST SET
data_train_X,data_test_X,data_train_y,data_test_y=data_preparation(dict_df_check["check2"])
columns = data_train_X.columns
start=time.time()
fig = plt.figure(figsize=(18,15))
f1_score_list = []
for i in range(1,9):
#WE SPLIT THE ORIGINAL DATASET TO A TRAINING AND TEST SET FOR EACH CHECK
#WE WILL OVERSAMPLE THE TRAINING SET AND KEEP THE TEST SET UNTIL THE END
data_train_X,data_test_X,data_train_y,data_test_y=data_preparation(dict_df_check['check'+str(i)])
columns = data_train_X.columns
#WE APPLY SMOTE
os_data_X,os_data_y=os.fit_sample(data_train_X,data_train_y)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns)
os_data_y= pd.DataFrame(data=os_data_y,columns=['check'+str(i)])
#WE NORMALISE OUR OVERSAMPLED DATASET
x = os_data_X.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
os_data_X = pd.DataFrame(x_scaled)
#WE NORMALISE OUR ORIGINAL TEST SET
x = data_test_X.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
data_test_X = pd.DataFrame(x_scaled)
# Turn the values into an array for feeding the classification algorithms.
os_data_X = os_data_X.values
os_data_y = os_data_y.values
#WE APPLY THE LOGISTIC REGRESSION ON THE OVERSAMPLED TRAINING SET
model2 = LogisticRegression()
model2.fit(os_data_X,os_data_y)
#WE PREDICT
predictions = model2.predict(data_test_X)
cnf_matrix = confusion_matrix(data_test_y,predictions)
# WE PLOT THE CONFUSION MATRIX OF THE PREDICTIONS ON THE ORIGINAL TEST SET
ax = fig.add_subplot(3,3,i)
plt.subplots_adjust(hspace=0.5)
sns.heatmap(cnf_matrix,annot=True,linewidths=1)
plt.title("Confusion_matrix of Check"+str(i))
plt.xlabel("Predicted_class")
plt.ylabel("Real class")
# WE COMPUTE THE F1 SCORE
f1_score_list.append(f1_score(data_test_y,predictions,average='macro'))
for j in range(8):
print("F1 score of Check"+str(j)+":",f1_score_list[j])
print("The average F1 score :", np.mean(f1_score_list))
end=time.time()
duration_logisticRegression=end-start
print("duration of the LogisticRegression:",round(duration_logisticRegression,2),"seconds")
plt.show()
#convert dataframe to arrays of features and targets
df_values = df.drop(['SessionNumber','SystemID'],axis=1).values
features = df_values[:,:30]
targets = df_values[:,30:]
# targets.shape
x_train,x_test,y_train,y_test = train_test_split(features,targets,shuffle=True,test_size=0.20)
print("number of data in the training set: ", x_train.shape)
print("number of data in the test set: ", x_test.shape)
d=[]
s=[]
fig = plt.figure(figsize=(15,5))
for i in [1,5,10,15,20]:
start=time.time()
clf = RandomForestClassifier(n_estimators=100, max_depth=i, random_state=0)
clf.fit(x_train, y_train)
end=time.time()
duration_randomForest=end-start
d.append(duration_randomForest)
predictions = clf.predict(x_test)
a=0
for j in range(8):
a+=f1_score(y_test[:,j], predictions[:,j],average='macro')
print("f1_score average is :", a/8, "with a depth %d"%i)
duration = end-start
print("The duration time is :", round(duration,2),"seconds")
s.append(a/8)
ax1=plt.subplot(1,2,1)
ax1.plot([1,5,10,15,20], d, color='b')
plt.title("duration according to the depth")
ax2=plt.subplot(1 , 2 , 2)
ax2.plot([1,5,10,15,20], s, color='r')
plt.title("score according to the depth")
plt.show()
start=time.time()
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)
clf.fit(x_train, y_train)
end=time.time()
duration_randomForest=end-start
print("most important feature for this algorithm", clf.feature_importances_)
print("duration of the Random Forest:",round(duration_randomForest,2),"seconds")
predictions = clf.predict(x_test)
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(18,15))
for i in range(8):
conf_mat = confusion_matrix(y_true=y_test[:,i], y_pred=predictions[:,i])
labels = ['0', '1']
ax = fig.add_subplot(3,3,i+1)
plt.subplots_adjust(hspace=0.7)
sns.heatmap(conf_mat,annot=True,linewidths=1)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.title("confusion matrix for check %d"%(i+1), pad=15)
plt.xlabel('Predicted')
plt.ylabel('Expected')
plt.show()
a=0
for i in range(8):
print('f1_score for check%s'%(i+1), f1_score(y_test[:,i], predictions[:,i], average='macro'))
a+=f1_score(y_test[:,i], predictions[:,i],average='macro')
print("\n",'f1_score average is', a/8)
from pysbrl import BayesianRuleList
from mdlp.discretization import MDLP
def compute_intervals(mdlp_discretizer):
category_names = []
for i, cut_points in enumerate(mdlp_discretizer.cut_points_):
idxs = np.arange(len(cut_points) + 1)
names = mdlp_discretizer.assign_intervals(idxs, i)
category_names.append(names)
return category_names
def test_BayesianRuleList(features, targets):
x, y = features.values, targets
feature_names = features.columns
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.33, random_state=42)
discretizer = MDLP(random_state=42).fit(x_train, y_train)
x_train_cat = discretizer.transform(x_train)
rule_list = BayesianRuleList(seed=1, feature_names=feature_names, max_rule_len=2)
rule_list.fit(x_train_cat, y_train)
print(rule_list)
x_test_cat = discretizer.transform(x_test)
print('acc: %.4f' % rule_list.score(x_test_cat, y_test))
#Correlation map to see how the attributes are correlated to check1
corrmat = df.corr()
corr_check1=corrmat.iloc[32:33,:32]
corr_check1
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check1, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 1
most_correlated_features1=pd.DataFrame()
for i in range(10):
name=corr_check1.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features1[name]=df[corr_check1.idxmax(axis=1)[0]] #add the corresponding column
corr_check1.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
most_correlated_features1.shape
most_correlated_features1.head()
#create an array with the check 1 values
targets_check1 = df.values[:,32:][:,0]
start=time.time()
test_BayesianRuleList(most_correlated_features1, targets_check1.astype(int))
end=time.time()
duration1=end-start
print("The Rule List lasted for :",round(duration1/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check2
corrmat = df.corr()
corr_check2=corrmat.iloc[33:34,:32]
corr_check2
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check2, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 2
most_correlated_features2=pd.DataFrame()
for i in range(10):
name=corr_check2.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features2[name]=df[corr_check2.idxmax(axis=1)[0]] #add the corresponding column
corr_check2.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features2.shape
most_correlated_features2.head()
#create an array with the check 2 values
targets_check2 = df.values[:,32:][:,1]
start=time.time()
test_BayesianRuleList(most_correlated_features2, targets_check2.astype(int))
end=time.time()
duration2=end-start
print("The Rule List lasted for :",round(duration2/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check3
corrmat = df.corr()
corr_check3=corrmat.iloc[34:35,:32]
corr_check3
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check3, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 3
most_correlated_features3=pd.DataFrame()
for i in range(10):
name=corr_check3.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features3[name]=df[corr_check3.idxmax(axis=1)[0]] #add the corresponding column
corr_check3.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features3.shape
most_correlated_features3.head()
#create an array with the check 3 values
targets_check3 = df.values[:,32:][:,2]
start=time.time()
test_BayesianRuleList(most_correlated_features3, targets_check3.astype(int))
end=time.time()
duration3=end-start
print("The Rule List lasted for :",round(duration3/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check4
corrmat = df.corr()
corr_check4=corrmat.iloc[35:36,:32]
corr_check4
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check4, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 4
most_correlated_features4=pd.DataFrame()
for i in range(10):
name=corr_check4.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features4[name]=df[corr_check4.idxmax(axis=1)[0]] #add the corresponding column
corr_check4.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features4.shape
most_correlated_features4.head()
#create an array with the check 4 values
targets_check4 = df.values[:,32:][:,3]
start=time.time()
test_BayesianRuleList(most_correlated_features4, targets_check4.astype(int))
end=time.time()
duration4=end-start
print("The Rule List lasted for :",round(duration4/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check5
corrmat = df.corr()
corr_check5=corrmat.iloc[36:37,:32]
corr_check5
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check5, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 5
most_correlated_features5=pd.DataFrame()
for i in range(10):
name=corr_check5.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features5[name]=df[corr_check5.idxmax(axis=1)[0]] #add the corresponding column
corr_check5.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features5.shape
most_correlated_features5.head()
#create an array with the check 5 values
targets_check5 = df.values[:,32:][:,4]
start=time.time()
test_BayesianRuleList(most_correlated_features5, targets_check5.astype(int))
end=time.time()
duration5=end-start
print("The Rule List lasted for :",round(duration5/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check6
corrmat = df.corr()
corr_check6=corrmat.iloc[37:38,:32]
corr_check6
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check6, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 6
most_correlated_features6=pd.DataFrame()
for i in range(10):
name=corr_check6.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features6[name]=df[corr_check6.idxmax(axis=1)[0]] #add the corresponding column
corr_check6.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features6.shape
most_correlated_features6.head()
#create an array with the check 6 values
targets_check6 = df.values[:,32:][:,5]
start=time.time()
test_BayesianRuleList(most_correlated_features6, targets_check6.astype(int))
end=time.time()
duration6=end-start
print("The Rule List lasted for :",round(duration6/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check7
corrmat = df.corr()
corr_check7=corrmat.iloc[38:39,:32]
corr_check7
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check7, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 7
most_correlated_features7=pd.DataFrame()
for i in range(10):
name=corr_check7.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features7[name]=df[corr_check7.idxmax(axis=1)[0]] #add the corresponding column
corr_check7.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features7.shape
most_correlated_features7.head()
#create an array with the check 7 values
targets_check7 = df.values[:,32:][:,6]
start=time.time()
test_BayesianRuleList(most_correlated_features7, targets_check7.astype(int))
end=time.time()
duration7=end-start
print("The Rule List lasted for :",round(duration7/60,2),"minutes")
#Correlation map to see how the attributes are correlated to check8
corrmat = df.corr()
corr_check8=corrmat.iloc[39:40,:32]
corr_check8
plt.subplots(figsize=(24,22))
sns.heatmap(corr_check8, vmax=0.9, square=True, cbar_kws={"shrink": 0.18}, annot=True, annot_kws={"size": 10, "rotation": 45})
plt.show()
#keep the 10 most correlated features for check 8
most_correlated_features8=pd.DataFrame()
for i in range(10):
name=corr_check8.iloc[0].idxmax(axis=0) #name of the most correlated column
most_correlated_features8[name]=df[corr_check8.idxmax(axis=1)[0]] #add the corresponding column
corr_check8.drop([name],axis=1,inplace=True) #drop the column in the correlation dataframe to take the following maximum value
# most_correlated_features8.shape
most_correlated_features8.head()
#create an array with the check 8 values
targets_check8 = df.values[:,32:][:,7]
start=time.time()
test_BayesianRuleList(most_correlated_features8, targets_check8.astype(int))
end=time.time()
duration8=end-start
print("The Rule List lasted for :",round(duration8/60,2),"minutes")
df['Check9'] = df['Check2'] + df['Check4'] + df['Check5']
#When two Checks are equal to 1
df['Check9'] = df['Check9'].replace(2.0,1.0)
#When three Checks are equal to 1
df['Check9'] = df['Check9'].replace(3.0,1.0)
#plot graphs representing the most correlated columns to Check9
# plt.figure(1,figsize=(20,20))
corr=df.corr()['Check9']
corr_abs = corr.apply(np.abs)
corr_abs.sort_values(ascending=False,inplace=True)
corr_abs_top_ten = corr_abs[1:11]
plt.plot(corr_abs_top_ten)
plt.xticks(rotation=90)
plt.title('Check9',fontsize=13)
plt.show()
df_9 = df.drop(['SessionNumber','SystemID','Check1','Check2','Check3','Check4','Check5','Check6','Check7','Check8'],axis=1)
#WE SPLIT THE ORIGINAL DATASET TO A TRAINING AND TEST SET FOR EACH CHECK
#WE WILL OVERSAMPLE THE TRAINING SET AND KEEP THE TEST SET UNTIL THE END
data_train_X,data_test_X,data_train_y,data_test_y=data_preparation(df_9)
columns = data_train_X.columns
#WE APPLY SMOTE
os_data_X,os_data_y=os.fit_sample(data_train_X,data_train_y)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns)
os_data_y= pd.DataFrame(data=os_data_y,columns=['Check9'])
#WE NORMALISE OUR OVERSAMPLED DATASET
x = os_data_X.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
os_data_X = pd.DataFrame(x_scaled)
#WE NORMALISE OUR ORIGINAL TEST SET
x = data_test_X.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
data_test_X = pd.DataFrame(x_scaled)
# Turn the values into an array for feeding the classification algorithms.
os_data_X = os_data_X.values
os_data_y = os_data_y.values
#WE APPLY THE LOGISTIC REGRESSION ON THE OVERSAMPLED TRAINING SET
model2 = LogisticRegression()
model2.fit(os_data_X,os_data_y)
#WE PREDICT
predictions = model2.predict(data_test_X)
cnf_matrix = confusion_matrix(data_test_y,predictions)
# WE PLOT THE CONFUSION MATRIX OF THE PREDICTIONS ON THE ORIGINAL TEST SET
sns.heatmap(cnf_matrix,annot=True,linewidths=1)
plt.title("Confusion_matrix of Check9")
plt.xlabel("Predicted_class")
plt.ylabel("Real class")
# WE COMPUTE THE F1 SCORE
f1 = f1_score(data_test_y,predictions,average='macro')
plt.show()
print("F1 score for Check9: ", f1)